Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add predicate pushdown (updated) #78

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mike-luabase
Copy link

@mike-luabase mike-luabase commented Oct 16, 2024

This PR enhances query performance by filtering data at the metadata level, reducing the amount of data read during scans. It currently only addresses queries containing a single Iceberg table.

Key Changes

  • Extended IcebergManifestEntry:

    • Added lower_bounds and upper_bounds maps to store column statistics.
  • Utility Function:

    • Implemented IcebergUtils::GetFullPath to resolve file paths accurately.
  • Metadata Retrieval:

    • Added GetEntries template in IcebergTable to fetch relevant manifest entries, excluding deleted ones.
  • Predicate Evaluation:

    • Created EvaluatePredicateAgainstStatistics to assess if data files satisfy query predicates based on their statistics.
    • For each predicate:
      • Identifies the column involved.
      • Checks if the column has defined lower and upper bounds.
      • Based on the comparison type (e.g., =, >, <), determines if the predicate can be satisfied given the file's bounds.
      • If any predicate fails, the file is excluded from the scan.
  • Scan Expression Modification:

    • Updated MakeScanExpression to filter data_file_entries using predicates before scanning.
  • Binding Function Enhancements:

    • Enhanced IcebergScanBindReplace with additional logging and prepared data files based on predicate results.

This implementation optimizes Iceberg table scans by leveraging metadata for early data filtering, significantly improving query efficiency and resource usage.

@mike-luabase mike-luabase changed the title Add predicate pushdown4 Add predicate pushdown (updated) Oct 16, 2024
select_statement->node = std::move(select_node);
return make_uniq<SubqueryRef>(std::move(select_statement), "iceberg_scan");
vector<Value> structs;
for (const auto &file : data_file_values) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of lines changed here are still formatting changes - could you format using our clang-format config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mytherin fixed the formatting, sorry about that!

@samansmink
Copy link
Collaborator

Also @mike-luabase, you don't need to reopen the PR every time, this makes it harder to review actually since we lose a clear view on the review comments that were made before.

If you want to overwrite commits you can just force push to your feature branch

@mike-luabase
Copy link
Author

Also @mike-luabase, you don't need to reopen the PR every time, this makes it harder to review actually since we lose a clear view on the review comments that were made before.

Yes, understood, won't do that again.

@mike-luabase
Copy link
Author

This test is failing, looking into it

SELECT count(*) FROM ICEBERG_SCAN('data/iceberg/generated_spec2_0_001/pyspark_iceberg_table');
Lower bounds is null
Upper bounds is null
Binder Error: Table "iceberg_scan_data" does not have a column named "filename"

@mike-luabase
Copy link
Author

@Mytherin test failure should be fixed now

@mike-luabase
Copy link
Author

@samansmink @Mytherin can we trigger the run again? This error doesn't seem related to my changes:

#7 [ 3/14] RUN apt-get install -y -qq software-properties-common
#7 0.232 E: Unable to locate package software-properties-common
#7 ERROR: process "/bin/sh -c apt-get install -y -qq software-properties-common" did not complete successfully: exit code: 100
------
 > [ 3/14] RUN apt-get install -y -qq software-properties-common:
0.232 E: Unable to locate package software-properties-common
------
Dockerfile:9
--------------------
   7 |     # Setup the basic necessities
   8 |     RUN apt-get update -y -qq
   9 | >>> RUN apt-get install -y -qq software-properties-common
  10 |     RUN apt-get install -y -qq --fix-missing ninja-build make gcc-multilib g++-multilib libssl-dev wget openjdk-8-jdk zip maven unixodbc-dev libc6-dev-i386 lib32readline6-dev libssl-dev libcurl4-gnutls-dev libexpat1-dev gettext unzip build-essential checkinstall libffi-dev curl libz-dev openssh-client pkg-config autoconf
  11 |     RUN apt-get install -y -qq ccache
--------------------
ERROR: failed to solve: process "/bin/sh -c apt-get install -y -qq software-properties-common" did not complete successfully: exit code: 100

@catkins
Copy link

catkins commented Oct 18, 2024

@mike-luabase 💚 I'm so excited to see movement on this! I'll be able to delete a bunch of manual partition pruning + read_parquet code in an app of mine!

@mike-luabase
Copy link
Author

@samansmink anything I can do to help here?

@samansmink
Copy link
Collaborator

If you could re-add a PR description and address @Mytherin's comment from one of your previous PRs #75 (comment), I will take a more detailed look to review this.

Please be considerate that reviews take a lot of time, especially reviews from outside contributors that are touching complicated code. To get your PRs through as quick as possible, I recommend you to:

  • double check your ideas for a PR with a core contributor of DuckDB through either the DuckDB discord channel, a github issue or a github discussion. See (https://github.com/duckdb/duckdb/blob/main/CONTRIBUTING.md)
  • Make sure that your PRs are easy to review
    • no big PRs
    • don't add big format changes to your PRs
    • write a clear description of what you are adding and why you chose the path you chose

@mike-luabase
Copy link
Author

@samansmink I added the description and fixed the formatting!

I understand this PR is on the larger side, but I think it's as small as it could be to get predicate pushdown working.

I'm more than happy to discuss the PR! I'll drop a note in the Discord.

Copy link
Contributor

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reducing the changeset - this is easier to review. I still wonder about the changes to the Avro files. In addition - there's no tests included with the changeset. Could you add tests that stress test the predicate pushdown in various ways (e.g. testing out different predicate types, different data types, etc)?


/* This code was generated by avrogencpp 1.13.0-SNAPSHOT. Do not edit.*/

#ifndef MANIFEST_ENTRY_HH_2043678367_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand a bit on why the avro/iceberg_types files are being changed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mytherin the existing types didn't have a way to access lower_bounds and upper_bounds, which we need for pushdown.

@mike-luabase
Copy link
Author

on it!

@mike-luabase
Copy link
Author

@Mytherin I added tests and updated the data generation for predicate pushdown. Let me know if that works!

@mike-luabase mike-luabase force-pushed the add-predicate-pushdown4 branch 3 times, most recently from eb8c753 to 136b5eb Compare October 27, 2024 12:12
@mike-luabase mike-luabase force-pushed the add-predicate-pushdown4 branch from 136b5eb to d9600d4 Compare October 27, 2024 12:24
@mike-luabase
Copy link
Author

I see files from the /data/iceberg have somehow made it back into the PR. I'm trying to remove them, but not having much luck.

@samansmink
Copy link
Collaborator

As mentioned before, @mike-luabase why is this PR removing the checked in test data?

Copy link

@peterboncz peterboncz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the efforts, Mike. Was looking through it; but I would not call it a review and in any case I think DuckDB Labs folks need to do that. I marked a number of diffs in the PR that I think could be removed: the deletion of the generated files (as already remarked by Sam) and some smaller things.

lower_bounds[std::to_string(lb.key)] = lb.value;
}
} else {
fprintf(stderr, "Lower bounds ISSSSS null\n");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISSSSS?

is printing to stderr the right way to signal the absence of bounds info in a manifest?

@@ -1,4 +1,4 @@
#!/usr/bin/python3
#!/Users/mritchie712/opt/anaconda3/bin/python

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break on other computers. In this file there are some useful changes (partitioning) but also spurious changes likes semicolons and spacing.. It would be nice to remove the spurious changes to make the diff as small as possible

@@ -80,8 +80,8 @@ vector<IcebergManifestEntry> IcebergTable::ReadManifestEntries(const string &pat
}
} else {
auto schema = avro::compileJsonSchemaFromString(MANIFEST_ENTRY_SCHEMA);
avro::DataFileReader<c::manifest_entry> dfr(std::move(stream), schema);
c::manifest_entry manifest_entry;
avro::DataFileReader<manifest_entry> dfr(std::move(stream), schema);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are spurious and hence this file change could be removed from the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants